Classifying and alleviating the communication overheads in matrix computations on large-scale NUMA multiprocessors

نویسندگان

Yi-Min Wang

Hsiao-Hsi Wang

Ruei-Chuan Chang

چکیده

Large-scale, shared-memory multiprocessors have non-uniform memory access (NUMA) costs. The high communication cost dominates the source of matrix computations' execution. Memory contention and remote memory access are two major communication overheads on large-scale NUMA multiprocessors. However, previous experiments and discussions focus either on reducing the number of remote memory accesses or on alleviating memory contention overhead. In this paper, we propose a simple but eective processor allocation policy, called rectangular processor allocation, to alleviate both overheads at the same time. The policy divides the matrix elements into a certain number of rectangular blocks, and assigns each processor to compute the results of one rectangular block. This methodology may reduce a lot of unnecessary memory accesses to the memory modules. After running many matrix computations under a realistic memory system simulator, we con®rmed that at least one-fourth of the communication overhead may be reduced. Therefore, we conclude that rectangular processor allocation policy performs better than other popular policies, and that the combination of rectangular processor allocation policy with software interleaving data allocation policy is a better choice to alleviate communication overhead. Ó 1998 Elsevier Science Inc. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustered affinity scheduling on large-scale NUMA multiprocessors

Modern shared-memory multiprocessors have high and non-uniform memory access (NUMA) costs. The communication cost gradually dominates the source of parallel applications’ execution. Algorithms based on affinity, like affinity scheduling algorithm (AFS), perform better than dynamic algorithms, such as guided self-scheduling (GSS) and trapezoid selfscheduling (TSS). However, as the number of proc...

متن کامل

Experiences with Data Distribution on NUMA Shared Memory Multiprocessors

The choice of a good data distribution scheme is critical to performance of data-parallel applications on both distributed memory multiprocessors and NUMA shared memory multiprocessors. The high cost of interprocessor communication in distributed memory multiprocessors makes the minimization of communications the predominant issue in selecting data distributionschemes. However, on NUMA multipro...

متن کامل

Hierarchical loop scheduling for clustered NUMA machines

Loop scheduling is an important issue in the development of high performance multiprocessors. As modern multiprocessors have high and non-uniform memory access (NUMA) costs, the communication costs dominate the execution of parallel programs. Previous anity algorithms perform better than dynamic algorithms under non-clustered NUMA multiprocessors, but they suer heavy overheads when migrating ...

متن کامل

Memory Latency Reduction with Fine-grain Migrating Threads in Numa Shared-memory Multiprocessors

In order to fully realize the potential performance benefits of large-scale NUMA shared memory multiprocessors, efficient techniques to reduce/tolerate long memory access latencies in such systems are to be developed. This paper discusses the concept, software and hardware support for memory latency reduction through fine-grain non-transparent thread migration, referred to as mobile multithread...

متن کامل

Multiprogrammed Parallel Application Scheduling in NUMA Multiprocessors

The invention, acceptance, and proliferation of multiprocessors are primarily a result of the quest to increase computer system performance. The most promising features of multiprocessors are their potential to solve problems faster than previously possible and to solve larger problems than previously possible. Large-scale multiprocessors offer the additional advantage of being able to execute ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Journal of Systems and Software

دوره 44 شماره

صفحات -

تاریخ انتشار 1998

Classifying and alleviating the communication overheads in matrix computations on large-scale NUMA multiprocessors

نویسندگان

چکیده

منابع مشابه

Clustered affinity scheduling on large-scale NUMA multiprocessors

Experiences with Data Distribution on NUMA Shared Memory Multiprocessors

Hierarchical loop scheduling for clustered NUMA machines

Memory Latency Reduction with Fine-grain Migrating Threads in Numa Shared-memory Multiprocessors

Multiprogrammed Parallel Application Scheduling in NUMA Multiprocessors

عنوان ژورنال:

اشتراک گذاری